在连续言语中找到单词边界是具有挑战性的,因为单词之间几乎没有或根本没有“空间”定界符。流行的贝叶斯非参数模型用于文本分割的模型使用Dirichlet过程来共同分段句子并构建单词类型的词典。我们介绍了DP-Parse,该DP-Parse使用类似的原则,但仅依赖于单词令牌的实例词典,避免了单词类型词典出现的聚类错误。在零资源语音基准2017上,我们的模型以5种语言设置了新的语音细分。该算法单调地改善了输入表示,当用弱监督的输入喂养时,得分却更高。尽管缺乏类型的词典,但DP-Parse仍可以管道使用到语言模型,并学习通过新的口语嵌入基准评估的语义和句法表示。
translated by 谷歌翻译
介绍了埃及对话中阿拉伯语语音错误发布检测的最大数据集。DataSet由表示最常用于阿拉伯语中最常用的100个单词的注释音频文件组成,由100埃及儿童(年龄在2到8岁之间)发出明显。通过专家侦听器收集数据集并注释在分段发音错误检测上。
translated by 谷歌翻译
This paper proposes a novel observer-based controller for Vertical Take-Off and Landing (VTOL) Unmanned Aerial Vehicle (UAV) designed to directly receive measurements from a Vision-Aided Inertial Navigation System (VA-INS) and produce the required thrust and rotational torque inputs. The VA-INS is composed of a vision unit (monocular or stereo camera) and a typical low-cost 6-axis Inertial Measurement Unit (IMU) equipped with an accelerometer and a gyroscope. A major benefit of this approach is its applicability for environments where the Global Positioning System (GPS) is inaccessible. The proposed VTOL-UAV observer utilizes IMU and feature measurements to accurately estimate attitude (orientation), gyroscope bias, position, and linear velocity. Ability to use VA-INS measurements directly makes the proposed observer design more computationally efficient as it obviates the need for attitude and position reconstruction. Once the motion components are estimated, the observer-based controller is used to control the VTOL-UAV attitude, angular velocity, position, and linear velocity guiding the vehicle along the desired trajectory in six degrees of freedom (6 DoF). The closed-loop estimation and the control errors of the observer-based controller are proven to be exponentially stable starting from almost any initial condition. To achieve global and unique VTOL-UAV representation in 6 DoF, the proposed approach is posed on the Lie Group and the design in unit-quaternion is presented. Although the proposed approach is described in a continuous form, the discrete version is provided and tested. Keywords: Vision-aided inertial navigation system, unmanned aerial vehicle, vertical take-off and landing, stochastic, noise, Robotics, control systems, air mobility, observer-based controller algorithm, landmark measurement, exponential stability.
translated by 谷歌翻译
Solute transport in porous media is relevant to a wide range of applications in hydrogeology, geothermal energy, underground CO2 storage, and a variety of chemical engineering systems. Due to the complexity of solute transport in heterogeneous porous media, traditional solvers require high resolution meshing and are therefore expensive computationally. This study explores the application of a mesh-free method based on deep learning to accelerate the simulation of solute transport. We employ Physics-informed Neural Networks (PiNN) to solve solute transport problems in homogeneous and heterogeneous porous media governed by the advection-dispersion equation. Unlike traditional neural networks that learn from large training datasets, PiNNs only leverage the strong form mathematical models to simultaneously solve for multiple dependent or independent field variables (e.g., pressure and solute concentration fields). In this study, we construct PiNN using a periodic activation function to better represent the complex physical signals (i.e., pressure) and their derivatives (i.e., velocity). Several case studies are designed with the intention of investigating the proposed PiNN's capability to handle different degrees of complexity. A manual hyperparameter tuning method is used to find the best PiNN architecture for each test case. Point-wise error and mean square error (MSE) measures are employed to assess the performance of PiNNs' predictions against the ground truth solutions obtained analytically or numerically using the finite element method. Our findings show that the predictions of PiNN are in good agreement with the ground truth solutions while reducing computational complexity and cost by, at least, three orders of magnitude.
translated by 谷歌翻译
Vulnerability to adversarial attacks is a well-known weakness of Deep Neural Networks. While most of the studies focus on natural images with standardized benchmarks like ImageNet and CIFAR, little research has considered real world applications, in particular in the medical domain. Our research shows that, contrary to previous claims, robustness of chest x-ray classification is much harder to evaluate and leads to very different assessments based on the dataset, the architecture and robustness metric. We argue that previous studies did not take into account the peculiarity of medical diagnosis, like the co-occurrence of diseases, the disagreement of labellers (domain experts), the threat model of the attacks and the risk implications for each successful attack. In this paper, we discuss the methodological foundations, review the pitfalls and best practices, and suggest new methodological considerations for evaluating the robustness of chest xray classification models. Our evaluation on 3 datasets, 7 models, and 18 diseases is the largest evaluation of robustness of chest x-ray classification models.
translated by 谷歌翻译
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies, compared to the local convolutional-based design. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters and compute cost. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on three benchmarks, Synapse, BTCV and ACDC, reveal the effectiveness of the proposed contributions in terms of both efficiency and accuracy. On Synapse dataset, our UNETR++ sets a new state-of-the-art with a Dice Similarity Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best existing method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.
translated by 谷歌翻译
Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.
translated by 谷歌翻译
HTR models development has become a conventional step for digital humanities projects. The performance of these models, often quite high, relies on manual transcription and numerous handwritten documents. Although the method has proven successful for Latin scripts, a similar amount of data is not yet achievable for scripts considered poorly-endowed, like Arabic scripts. In that respect, we are introducing and assessing a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib{\=i} scripts. The comparison between several state-of-the-art HTR demonstrates the relevance of a word-based neural approach specialized for Arabic, capable to achieve an error rate below 5% with only 10 pages manually transcribed. These results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing. This research is part of the development of RASAM dataset in partnership with the GIS MOMM and the BULAC.
translated by 谷歌翻译
End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance transfer from high-resource to low-resource languages. However, scaling up the number of languages, total hours, and number of unique tokens is not a trivial task. This paper explores large-scale multilingual ASR models on 70 languages. We inspect two architectures: (1) Shared embedding and output and (2) Multiple embedding and output model. In the shared model experiments, we show the importance of tokenization strategy across different languages. Later, we use our optimal tokenization strategy to train multiple embedding and output model to further improve our result. Our multilingual ASR achieves 13.9%-15.6% average WER relative improvement compared to monolingual models. We show that our multilingual ASR generalizes well on an unseen dataset and domain, achieving 9.5% and 7.5% WER on Multilingual Librispeech (MLS) with zero-shot and finetuning, respectively.
translated by 谷歌翻译
The emergence of COVID-19 has had a global and profound impact, not only on society as a whole, but also on the lives of individuals. Various prevention measures were introduced around the world to limit the transmission of the disease, including face masks, mandates for social distancing and regular disinfection in public spaces, and the use of screening applications. These developments also triggered the need for novel and improved computer vision techniques capable of (i) providing support to the prevention measures through an automated analysis of visual data, on the one hand, and (ii) facilitating normal operation of existing vision-based services, such as biometric authentication schemes, on the other. Especially important here, are computer vision techniques that focus on the analysis of people and faces in visual data and have been affected the most by the partial occlusions introduced by the mandates for facial masks. Such computer vision based human analysis techniques include face and face-mask detection approaches, face recognition techniques, crowd counting solutions, age and expression estimation procedures, models for detecting face-hand interactions and many others, and have seen considerable attention over recent years. The goal of this survey is to provide an introduction to the problems induced by COVID-19 into such research and to present a comprehensive review of the work done in the computer vision based human analysis field. Particular attention is paid to the impact of facial masks on the performance of various methods and recent solutions to mitigate this problem. Additionally, a detailed review of existing datasets useful for the development and evaluation of methods for COVID-19 related applications is also provided. Finally, to help advance the field further, a discussion on the main open challenges and future research direction is given.
translated by 谷歌翻译